Qualitative Variables

A two-valued qualitative variable can be represented by a single 0-or-1-valued "dummy" variable. If a qualitative variable has three or more possible values (e.g., make-of-car, or marital-status), choose one value as the "foundation" case, and create one 0-or-1-valued "difference" variable for each other value. (The coefficient of each difference variable represents the difference between the associated value, and the foundation case.)

Example: To include automobile-make (with the values Ford, Honda, BMW, and Sterling) in the maintenance-and-repair-cost model, create three new variables:

	D_Honda-Ford	D_BMW-Ford	D_{Sterling-Ford}
Ford	0	0	0	foundation case
Honda	1	0	0
BMW	0	1	0
Sterling	0	0	1

The regression model will look like:

Cost = α + β₁Mileage + β₂Age + β₃D_H-F + β₄D_B-F + β₅D_S-F + ε

which actually consists of four separate models estimated from the same sample:

Ford:	Cost = α + β₁Mileage + β₂Age + ε
Honda:	Cost = α + β₁Mileage + β₂Age + β₃ + ε
BMW:	Cost = α + β₁Mileage + β₂Age + β₄ + ε
Sterling:	Cost = α + β₁Mileage + β₂Age + β₅ + ε

A natural question: When the qualitative variable takes k categorical values, why don't we simply encode it with k "yes/no" variables? Conceptually, this would work. Practically, it fails inside the regression "machine," because there is no longer a unique set of best-fitting regression coeficients. This is why the "one foundation case plus k-1 dummy variables" approach is required.

There are two costs incurred when the "dummy variable" encoding trick is used: Two of the classical regression statistics become difficult to interpret.

The beta weights of the original explanatory variables can be compared as before. However, there is no simple way to directly interpret the beta-weights of the dummy variables. (Fortunately, this cost is typically more than compensated by having a more accurate model.)

The other regression statistic that becomes difficult to interpret after this one-into-many process is the significance level. If some of the dummy variables have large significance levels, and others are close to 0%, what will you do? You cannot exclude some and include others, since altogether they represent a single real variable. Instead, you want to test the null hypothesis “H₀: The qualitative variable does not "belong" in the model (i.e., all of the coefficients of the dummy variables are 0).” This can be done, via a technique known as "analysis of variance" (ANOVA).